200819 Expt.295: RNP Enrichment Under Different pH Conditions

NB: This demonstration analysis was performed on unpublished development data. This experiment was designed to investigate the effect of pH on driving RNP enrichment to the interphase of a phenol chloroform preparation.

Aims:
A. To test the effects of pH manipulation on concentrating RBP-RNA complexes to the interphase of a customised Phenol-Chloroform Cocktail
B. To compare these results with past studies

Method:
See associated writeup for Expt.295

1. Import Modules and Files

Custom Functions
jwrangle.importMixedFiles( )

I generally import everything I MIGHT use at the start and set up pathing using the OS-agnostic pathlib.

In [1]:
#### File utilities
import os
import pandas as pd
from pathlib import Path
from imp import reload

#### Data Wrangling
import copy
import numpy as np

#### RBP Suite Modules
import jwrangle
import jvis
import jinspect
import jtest
import jweb

#### Sequence Tools
from Bio import SeqIO

#### Graphical Packages
import upsetplot as upset
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt

#### define working directories
cwd  = Path(os.getcwd())
base_path = Path(os.path.join(*cwd.parts[:cwd.parts.index('experiments')]))

#### MaxQuant proteinGroups & evidence files
MQ_folder = jwrangle.importMixedFiles(cwd / 'MaxQuant')
MQ_folder.keys()

pGroups = MQ_folder['proteinGroups.txt']
evidence = MQ_folder['evidence.txt']

#### Inspect MQ setup
MQ_folder['parameters.txt'].head(9)
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3242: DtypeWarning: Columns (4,5,55,56,58,63) have mixed types.Specify dtype option on import or set low_memory=False.
  if (await self.run_code(code, result,  async_=asy)):
Out[1]:
Parameter Value
0 Version 1.6.7.0
1 User name smith.j
2 Machine name MSCYPHER-253
3 Date of writing 07/06/2020 15:05:00
4 Include contaminants True
5 PSM FDR 0.01
6 PSM FDR Crosslink 0.01
7 Protein FDR 0.01
8 Site FDR 0.01

2. Metadata Creation

Custom Functions
jwrangle.MQ_writeMetadata( )

Metadata tabulates the test conditions for ALL experiments that shared the same MQ search and thuss all experiments that comprise the MQ outputs. Metadata can also be done in a spreadsheet program. Here, I have created my metadata programmatically instead (I simply find this easier).
The metadata table gives users the opportunity to rename samples and define the experimental parameters for the data. This task can be expecially complex for MaxQuant because a unified output is generated even if distinctly separate experiments are searched as a batch and with different parameters applied.
The function jwrangle.MQ_writeMetadata( ) will take a metadata table, rename all samples in the proteinGroups and evidence files, assign alternative filenames, and save new copies to be used in future analyses.

In [1]:
#### Inspect column names
colnames = list(pGroups.columns.values)

#### Derive experiment names as a list
#### The instrument and MQ have assigned some irritating notation to the samples
experiment_names = []
for i in colnames:
    if 'Intensity ' in i:
        experiment_names.append(i.replace('Intensity ', ''))

#### Create a list of associated conditions
#### These samples were mislabeled 'NC' and 'PC' (the latter is often short for PAR-CL, but these samples are 254nm crosslinked)
#### The pH on these samples needs to be corrected)
condition =  []
for i in experiment_names:
    trim = i.split('-')[0]
    pref = trim.split('_')[2] + '_' + trim.split('_')[3]
    if 'NC' in pref:
        pref = pref.replace('NC','nCL')
    else:
        pref = pref.replace('PC','254')
    
    if 'pH4' in pref:
        pref = pref.replace('pH4OdT','pH4.5')
    elif 'pH5' in pref:
        pref = pref.replace('pH5OdT','pH5.5')
    elif 'pH8' in pref:
        pref = pref.replace('pH8OdT','pH7.8')
          
    condition.append(pref)

#### Create a list of associated replicate identifiers
replicate = []
for i in experiment_names:
    replicate.append(i.split('-')[-1])

#### Create a more reader friendly list of sample names
samples =  []
for i in experiment_names:
    pref = i.split('_')[0]
    name_clean = i.replace(pref+'_','')
    if len(name_clean.split('_')[0])==1:
        newprefix = '0' + name_clean.split('_')[0]
        name_clean = newprefix + '_' + name_clean.split('_',1)[1] 
    
    if 'NC' in name_clean:
        name_clean = name_clean.replace('NC','nCL')
    else:
        name_clean = name_clean.replace('PC','254')
    
        
    if 'pH4' in name_clean:
        name_clean = name_clean.replace('pH4OdT','pH4.5')
    elif 'pH5' in name_clean:
        name_clean = name_clean.replace('pH5OdT','pH5.5')
    elif 'pH8' in name_clean:
        name_clean = name_clean.replace('pH8OdT','pH7.8')
    
    samples.append(name_clean)
    
#### The MQ_groups dictionary assigns each condition to its MQ group parameter
MQ_groups = {'pH4.5':['pH4.5_nCL','pH4.5_254'], 
             'pH5.5':['pH5.5_nCL','pH5.5_254'],
             'pH7.8':['pH7.8_nCL','pH7.8_254']}

#### We then use the condition list to create an MQ_groups list
Grp_parameters = []
for label in condition:
    for key, value in MQ_groups.items():
        if label in value:
            Grp_parameters.append(key)
            
expt_df = pd.DataFrame(
    {'experiment': experiment_names,
     'condition': condition,
     'replicate': replicate,
     'sample':samples,
     'measure':['Intensity']*len(samples),                  # adding this column allows our metadata file to be compatible with Proteus
     'MQgroups':Grp_parameters
    })

expt_df
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-aeece31b2618> in <module>
      1 #### Inspect column names
----> 2 colnames = list(pGroups.columns.values)
      3 
      4 #### Derive experiment names as a list
      5 #### The instrument and MQ have assigned some irritating notation to the samples

NameError: name 'pGroups' is not defined
In [9]:
#### Write the modified proteins Groups and evidence files to a local folder
MQ_expt295 = jwrangle.MQ_writeMetadata(pGroups, evidence, expt_df, 'e295', cwd)
metadata.csv created in metadata folder
e295_proteinGroups_metalabeled.txt created in MaxQuant folder
e295_evidence_metalabeled.txt created in MaxQuant folder

3. Re-Annotate the MaxQuant pGroups with stable Gene IDs

Functions
jweb.mapAnyID( )
jwrangle.importMixedFiles( )

MaxQuant does a good job of assigning a Gene name to each protein group. Presumably these gene names come from the FASTA. However:

  • Sometimes it fails to find a gene name
  • Sometimes it will assign an ID that is not a gene or include out-of-place characters
  • It doesn't always seem to be consistent
  • If the gene name originates from the FASTA then repeating the MQ search with an updated FASTA is the only way to update the gene IDs.
  • Use of a mapping service will standardise the ID conversion practices between my datasets and those of others, including RNA-Seq.

To avoid these problems we will remap the Majority protein IDs to ENTREZ gene IDs. jweb.mapAnyID( ) will retrieve all possible genes for each protein group, and will also select a primary ID to singularly represent the group by a consistent method. This is a very flexible function, see help( ) for further explanation. From this point, the MQ 'Gene names' column will no longer be necessary. This function can also handle ID mapping to and from almost any convention.

Ensuring our proteins have a consistent gene naming strategy is essential for inter-experiment comparison and the later use of set methods. It also creates a standard that can be applied for accurately mapping RNA-Seq results and thus aid in future mapping of protein-RNA partners.

In [2]:
#### If not already loaded, read in the metadata-adjusted files
metadata = pd.read_csv(cwd / 'metadata' / 'e295_metadata.csv', index_col = 0)
pGroups = pd.read_csv(cwd / 'MaxQuant' / 'e295_proteinGroups_metalabeled.txt', delimiter = '\t')
evidence = pd.read_csv(cwd / 'MaxQuant' / 'e295_evidence_metalabeled.txt', delimiter = '\t')
C:\Users\smith.j\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3051: DtypeWarning: Columns (4,5,55,56,58,63) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [3]:
#### Dynamically remap gene names in our proteinGroups file and save a copy
# pGroups_remap = jweb.mapAnyID_gPro(pGroups['Majority protein IDs'].tolist(), splitstr = [';', '-'], geneProductType = 'protein', 
#                               gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'pGroups_remap'], writeTargetsAsList = 'NO')
In [4]:
#### If not already loaded, read in the remapped proteinGroups file
pGroups_remap = jwrangle.importMixedFiles(cwd / 'downloads' / 'pGroups_remap', dropSuffix = 'yes')
pGroups_remap.keys()
Out[4]:
dict_keys(['id_map', 'query_map'])
In [5]:
#### jwrangle.importMixedFiles() returns a dictionary where keys = files. We want the 'id_map' table created by jweb.mapAnyID_gPro().
#### We'll rename the Query column and drop duplicates so the table can be merged with our proteinGroups table.
id_map = pGroups_remap['id_map'].rename(columns={'Query': 'Majority protein IDs'}).drop_duplicates()
id_map.head(2)
Out[5]:
Majority protein IDs ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name UNIPROT_gPro status
0 A0A024R4E5;Q00341;Q00341-2;H0Y394 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... SWISSPROT
1 A0A024R4M0;P46781;B5MCT8;C9JM19 RPS9 RPS9 ribosomal protein S9 [Source:HGNC Symbol;Acc:H... SWISSPROT
In [6]:
#### Now use merge to add these new columns to our proteinGroups table
pGroups_map = pd.merge(pGroups, id_map, on='Majority protein IDs', how='left')

#### Check the tables are merged by viewing column elements from each.
pGroups_map[id_map.columns.tolist() + ['Peptide IDs']].head(2)
Out[6]:
Majority protein IDs ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name UNIPROT_gPro status Peptide IDs
0 A0A024R4E5;Q00341;Q00341-2;H0Y394 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... SWISSPROT 25;321;749;763;961;1370;1398;1508;1663;1692;17...
1 A0A024R4M0;P46781;B5MCT8;C9JM19 RPS9 RPS9 ribosomal protein S9 [Source:HGNC Symbol;Acc:H... SWISSPROT 2663;2664;4476;4677;4678;4818;5615;5914;6005;6...

4. Review Contaminants by Sample

Functions
jinspect.MQ_getContaminants( )
MQ_getContaminants_sbplot( )
jwrangle.importMixedFiles( )

We can extract the conaminants from our proteinGroups file using jinspect.MQ_getContaminants( ). These extracted table will return log2(iBAQ values).
Contaminants can then be reviewed with _MQ_getContaminantssbplot( ).

In [7]:
#### Extract contaminants
contaminants = jinspect.MQ_getContaminants(pGroups_map, metadata)
contaminants.head(2)
Out[7]:
01_pH4.5_nCL-A 10_pH5.5_nCL-D 11_pH5.5_nCL-E 12_pH5.5_nCL-F 13_pH7.8_nCL-A 14_pH7.8_nCL-B 15_pH7.8_nCL-C 16_pH7.8_nCL-D 17_pH7.8_nCL-E 18_pH7.8_nCL-F ... 33_pH7.8_254-C 34_pH7.8_254-D 35_pH7.8_254-E 36_pH7.8_254-F 04_pH4.5_nCL-D 05_pH4.5_nCL-E 06_pH4.5_nCL-F 07_pH5.5_nCL-A 08_pH5.5_nCL-B 09_pH5.5_nCL-C
Protein ID: Gene
ENSBTAP00000038253: nan 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
CON__P00761: nan 25.804274 23.886416 27.46334 26.498277 24.591379 24.724216 26.948488 25.729178 23.444933 26.222748 ... 30.401516 30.038262 30.391205 30.435692 29.082636 30.063168 28.112969 25.805505 23.790345 24.178433

2 rows × 36 columns

In [8]:
#### Sort the metadata into a more intuitive order
metadata_sort = metadata.sort_values(by = ['MQgroups','sample'])

#### Visually inspect contaminants  
jvis.MQ_getContaminants_sbplot(contaminants, metadata_sort, width = 1, length = 2, layout = 'individual')
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>

Results: UNUSUAL

The contaminant profiles of pH 4/8 are similar with a high concentration falling in their respective 254 fractions.
Strangely the contaminant profile of all nCL and 254 samples in the pH5 fraction are similar.
Not sure what to make of this with regard to sample processing but it can be reviewed in the repeat of this experiment, expt.297

5. Assess Digestion Efficiency

Functions
jinspect.MQ_getMissedCleavages( )
jvis.CommonPalettesAsHex
jvis.BarPlotByGroup_sbplot( )

Assessing missed cleavages is an essential metric for understanding the quality of the tryptic digestion. This data is recorded in the evidence file.
_jinspect.MQgetMissedCleavages( ) will return a long form data table that can easily be used for plotting.
The dictionary jvis.CommonPalettesAsHex contains a number of palettes that are common to both matplotlib and ggplot (from R). These are provided to ensure consistency is easy to achieve across both languages.
We'll plot the missed cleavages with the generic function jvis.BarPlotByGroup_sbplot( )

In [9]:
#### Extract the missed cleavage data into a long form table for plotting
MissedCleavages = jinspect.MQ_getMissedCleavages(evidence, metadata, drop_contaminants = True)
MissedCleavages.sort_values(by=['expt','sample'], inplace = True)
MissedCleavages
Out[9]:
sample % Missed Cleavages group expt
32 01_pH4.5_nCL-A 12 pH4.5_nCL pH4.5
27 02_pH4.5_nCL-B 7 pH4.5_nCL pH4.5
35 03_pH4.5_nCL-C 5 pH4.5_nCL pH4.5
7 04_pH4.5_nCL-D 15 pH4.5_nCL pH4.5
5 05_pH4.5_nCL-E 9 pH4.5_nCL pH4.5
11 06_pH4.5_nCL-F 13 pH4.5_nCL pH4.5
4 19_pH4.5_254-A 24 pH4.5_254 pH4.5
0 20_pH4.5_254-B 21 pH4.5_254 pH4.5
33 21_pH4.5_254-C 23 pH4.5_254 pH4.5
16 22_pH4.5_254-D 22 pH4.5_254 pH4.5
26 23_pH4.5_254-E 23 pH4.5_254 pH4.5
14 24_pH4.5_254-F 23 pH4.5_254 pH4.5
28 07_pH5.5_nCL-A 23 pH5.5_nCL pH5.5
25 08_pH5.5_nCL-B 23 pH5.5_nCL pH5.5
10 09_pH5.5_nCL-C 20 pH5.5_nCL pH5.5
15 10_pH5.5_nCL-D 21 pH5.5_nCL pH5.5
22 11_pH5.5_nCL-E 19 pH5.5_nCL pH5.5
8 12_pH5.5_nCL-F 30 pH5.5_nCL pH5.5
13 25_pH5.5_254-A 21 pH5.5_254 pH5.5
3 26_pH5.5_254-B 23 pH5.5_254 pH5.5
18 27_pH5.5_254-C 27 pH5.5_254 pH5.5
23 28_pH5.5_254-D 26 pH5.5_254 pH5.5
19 29_pH5.5_254-E 26 pH5.5_254 pH5.5
34 30_pH5.5_254-F 26 pH5.5_254 pH5.5
9 13_pH7.8_nCL-A 14 pH7.8_nCL pH7.8
29 14_pH7.8_nCL-B 13 pH7.8_nCL pH7.8
1 15_pH7.8_nCL-C 14 pH7.8_nCL pH7.8
21 16_pH7.8_nCL-D 12 pH7.8_nCL pH7.8
12 17_pH7.8_nCL-E 11 pH7.8_nCL pH7.8
24 18_pH7.8_nCL-F 12 pH7.8_nCL pH7.8
6 31_pH7.8_254-A 26 pH7.8_254 pH7.8
20 32_pH7.8_254-B 26 pH7.8_254 pH7.8
30 33_pH7.8_254-C 26 pH7.8_254 pH7.8
17 34_pH7.8_254-D 25 pH7.8_254 pH7.8
31 35_pH7.8_254-E 24 pH7.8_254 pH7.8
2 36_pH7.8_254-F 26 pH7.8_254 pH7.8
In [10]:
### Select a colour palette  
cpal = jvis.CommonPalettesAsHex

set2_paired = []
for i in cpal['Set2_qual']:
    set2_paired.append(i)
    set2_paired.append(i)

#### Plot the grouped data points  
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(MissedCleavages, x_col = 'group', y_col = '% Missed Cleavages', title = '% Missed Cleavages', pal = set2_paired)
<Figure size 432x288 with 0 Axes>

Results: Average

Digestion efficiency isn't great. Let's reduce the digestion volume by 0.5x and maintain the same trypsin concentration in future experiments

6. Remove Contaminants

Functions
jwrangle.MQ_getThreePassFilter( )
SeqIO.parse( )

After QC we no longer want the contaminants in our data. jwrangle.MQ_getThreePassFilter( ) will remove reverse peptides, contaminants, and only identified by site from MQ tables.
The filter will also accept customised exclusion lists in case users have added odd protein species to the search FASTA tables. In this particular experiment we added to the human FASTA, RNAse proteins and the large T antigen. The former as 1) a check that dynamic range is not being overwhelmed and 2) as an quantitative spike-in control to compare tryptic efficiency and the sample recovery across samples following C18 cleanup.

In [11]:
#### Map the location of the custom FASTA elements
os.listdir(base_path / 'my_resources' / 'FASTA')  

#### Create a list of the non-human proteins that were added to the custom FASTA genome search. 
new_cont = []
with open(base_path / 'my_resources' / 'FASTA' / "custom_proteome_elements.fasta", "r") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        new_cont.append(record.id.split('|')[1])

#### Remove all unwanted contaminants and IDs from the proteinGroups table      
pGroup_clean = jwrangle.MQ_getThreePassFilter(pGroups_map, custom_exclusion = new_cont)

#### Inspect the cleaned dataframe
pGroup_clean[['ENTREZGENE_gPro primary'] + [i for i in pGroup_clean.columns if 'iBAQ' in i]].head(2)
Out[11]:
ENTREZGENE_gPro primary iBAQ iBAQ 01_pH4.5_nCL-A iBAQ 10_pH5.5_nCL-D iBAQ 11_pH5.5_nCL-E iBAQ 12_pH5.5_nCL-F iBAQ 13_pH7.8_nCL-A iBAQ 14_pH7.8_nCL-B iBAQ 15_pH7.8_nCL-C iBAQ 16_pH7.8_nCL-D ... iBAQ 33_pH7.8_254-C iBAQ 34_pH7.8_254-D iBAQ 35_pH7.8_254-E iBAQ 36_pH7.8_254-F iBAQ 04_pH4.5_nCL-D iBAQ 05_pH4.5_nCL-E iBAQ 06_pH4.5_nCL-F iBAQ 07_pH5.5_nCL-A iBAQ 08_pH5.5_nCL-B iBAQ 09_pH5.5_nCL-C
0 HDLBP 1.965600e+08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 16325000.0 9819400.0 16333000.0 21658000.0 0.0 0.0 0.0 0.0 0.0 0.0
1 RPS9 2.840400e+09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 244410000.0 159950000.0 251630000.0 333440000.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 38 columns

7. Drop Gene Duplicates and Filter Intensities by LFQ

Functions
jinspect.MQ_dropDuplicateIDs( )

The next step focuses on improving confidence in the quality of our data. This is done by applying jinspect.MQ_dropDuplicateIDs( ) which has the below effects:

  • Because one gene can have many proteins, sometimes Maxquant will create multiple proteinGroups for a single gene. As most of our analysis focuses on genes we'll trim the lowest quality proteinGroups duplicates from the table.
  • Standard LFQ defaults require a minimum of 2 peptide species, at least one of which must be unique, for quantitation to be applied. Intensity and iBAQ values, however, do not have such a minimum limit. I consider a 2 peptide minimum to be a wise filter but still have use for the Intensity and iBAQ values. Thus where the LFQ filter is applied all measurements that do not meet the minimum limit will be discarded. In short, if there isn't a companion LFQ value, there won't be an Intensity or iBAQ value either after filtering.
  • It has been documented that Match Between Runs suffers a high frequency of false peptide transfers (Lim, Paulo, Gygi 2019; doi: 10.1021/acs.jproteome.9b00492). At the protein level, however, this false transfer rate is greatly mitigated by the minimum peptide rule applied by the LFQ algorithm. This is another good reason for our filtering step.
In [12]:
#### Drop duplicates and apply LFQ filter
filter_dict = jinspect.MQ_dropDuplicateIDs(pGroup_clean, metadata, prefix = 'Peptides', ID = 'ENTREZGENE_gPro primary', pool = 'measure', drop_ID = 'None', 
                                            keep_PoolCalcs = False, applyLFQ_filter = ['Intensity', 'iBAQ'])
#### Inspect filter dictionary
filter_dict.keys()
WARNING: jinspect.MQ_getMeasuredMeansByGroup() has not been checked
Out[12]:
dict_keys(['df_keep', 'df_droprows'])
In [13]:
#### The df_keep value contains our targets, df_droprows conatins the discarded duplicates. Assign the df_keep value to a new variable and inspect.
pGroup_filtered = filter_dict['df_keep']
pGroup_filtered.head(2)
Out[13]:
Protein IDs Majority protein IDs Peptide counts (all) Peptide counts (razor+unique) Peptide counts (unique) Protein names Gene names Fasta headers Number of proteins Peptides ... iBAQ 27_pH5.5_254-C iBAQ 28_pH5.5_254-D iBAQ 29_pH5.5_254-E iBAQ 30_pH5.5_254-F iBAQ 31_pH7.8_254-A iBAQ 32_pH7.8_254-B iBAQ 33_pH7.8_254-C iBAQ 34_pH7.8_254-D iBAQ 35_pH7.8_254-E iBAQ 36_pH7.8_254-F
0 A0A024R4E5;Q00341;Q00341-2;H0Y394;H7C0A4;C9JIZ... A0A024R4E5;Q00341;Q00341-2;H0Y394 52;51;47;36;17;15;14;12;11;11;10;10;8;6;6;5;5;... 52;51;47;36;17;15;14;12;11;11;10;10;8;6;6;5;5;... 52;51;47;36;17;15;14;12;11;11;10;10;8;6;6;5;5;... Vigilin HDLBP ;;; 23 52 ... 13159000.0 9700500.0 13087000.0 16089000.0 14684000.0 18556000.0 16325000.0 9819400.0 16333000.0 21658000.0
1 A0A024R4M0;P46781;B5MCT8;C9JM19;F2Z3C0;A8MXK4 A0A024R4M0;P46781;B5MCT8;C9JM19 15;15;12;12;3;3 15;15;12;12;3;3 15;15;12;12;3;3 40S ribosomal protein S9 RPS9 ;;; 6 15 ... 203450000.0 146150000.0 218280000.0 260780000.0 157770000.0 186250000.0 244410000.0 159950000.0 251630000.0 333440000.0

2 rows × 370 columns

8. Review Sample Clustering by Group

Functions
jtest.getDistanceMatrix( )
jvis.MQ_showDendrogramQC_mplplot( )

A distance matrix function jtest.getDistanceMatrix( ) is provided for users who wish to apply different algorithms or create different visualisations.
I like the 'ward' method for distance calculations and using a dengrogram to confirm that clustering matches expectations and so use a prerolled function jvis.MQ_showDendrogramQC_mplplot( )

In [14]:
#### Confirm that clustering matches expectations
jvis.MQ_showDendrogramQC_mplplot(pGroup_filtered, 'LFQ intensity', metadata, 'QC clustering: ', grid = 'YES', fsize = (8, 8))

Result: GOOD

Clustering reveals expected results

9. Analyse Normalisation Effects by Sample

Functions
jwrangle.Log2_ByPrefix( )
jwrangle.MQ_poolMulti( )
jvis.ViolinCompare_sbplot( )

Here we review normalisation effects on each sample within the condition groups; these are most easily interpreted after log2 transformation. We will transform all measures of interest with _jwrangle.Log2ByPrefix( ) and then pool all the values of interest, by condition, with _jwrangle.MQpoolMulti( ). The function _jvis.ViolinComparesbplot( ) will let use compare Intensity distribution on a per sample basis.

Normalisation is applied to LFQ values by MaxQuant and is a feature of its handling of label-free data. I've not seen a detailed explanation of how it works though so it is a leap of faith that Cox and Mann have selected an appropriate method.
Normalisation must be applied separately to nCL and cCL groups. This is unusual though necessary to avoid outrageous results caused by having groups with extreme differences. See expt.313 for evidence.

In [15]:
#### Log2 transform available intensity values.
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_filtered, 'LFQ intensity')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'iBAQ')
pGroup_log2 = jwrangle.Log2_ByPrefix(pGroup_log2, 'Intensity')
pGroup_log2.replace(0,np.nan, inplace=True)
In [16]:
#### Create a long form dataset for each desired grouping
pool_SampleIntensity = jwrangle.MQ_poolMulti(pGroup_log2, metadata, melt_list = ['Intensity', 'LFQ intensity'], group = 'condition')
pool_SampleIntensity.keys()
Out[16]:
dict_keys(['pH4.5_nCL', 'pH5.5_nCL', 'pH7.8_nCL', 'pH4.5_254', 'pH5.5_254', 'pH7.8_254'])
In [17]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH4.5_nCL'], title = 'pH4.5 nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [18]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH5.5_nCL'], title = 'pH5.5 nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [19]:
#### Inspect the Intensity shifts generated by the LFQ normalisation algorithm. In MaxQuant Raw Intensity and iBAQ values are 
#### not subjected to the LFQ normalisation calculations so this is a good way to spot any gross violations
sns.set_style('whitegrid')
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7.8_nCL'], title = 'pH7.8 nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = ['#ff6666', '#99ccff'])
In [20]:
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH4.5_254'], title = 'pH4.5 254: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])
In [21]:
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH5.5_254'], title = 'pH5.5 nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])
In [22]:
jvis.ViolinCompare_sbplot(pool_SampleIntensity['pH7.8_254'], title = 'pH7.8 nCL: Normalisation Effects', ylabel = 'Log2(Intensity)', palette = cpal['Set3_qual'])

Result: GOOD

No weird sample variations
No outrageous normalisation adjustments

10. Compare Intensity Distribution and Sequence Coverage

Functions
jwrangle.MQ_poolDataByCondition( )
jvis.BoxPlotByColumn_sbplot( )

Next we will compare intensity and sequence coverage between groups. Log2 transformation has already been performed so we need only use jwrangle.MQ_poolDataByCondition( ) to create the appropriate long form dataset for plotting with jvis.BoxPlotByColumn_sbplot( ).

In [23]:
#### Pool data into a single long form dataset
pooled_dfDropGroupOne = jwrangle.MQ_poolDataByCondition(pGroup_log2, metadata_sort, prefix_list = ['Intensity', 'Sequence coverage'])
In [24]:
#### Compare Intensity distribution using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Intensity: ', 'Intensity')
In [25]:
#### Compare Sequence coverage using a box and whisker plot
sns.set_style('whitegrid')
jvis.BoxPlotByColumn_sbplot(pooled_dfDropGroupOne, 'Sequence coverage: ', 'Sequence coverage %')

Result: GOOD

These results are consistent with expectations.

11. Compare Sum Peptide Counts

Functions
jinspect.MQ_getSumBySample( )
jvis.BarPlotByGroup_sbplot( )

To sum the total peptides observed across all proteins use _jinspect.MQgetSumBySample( ). These sums will be returned as a modified metadata table.
Plotting these by group is easily done with jvis.BarPlotByGroup_sbplot( ). The plotting order is determined by the metadata ordering.
In this case we are inspecting the number of peptides detected after having removed contaminants- thus if some spike-in proteins were removed, i.e. in this case RNAse treatments, they will not contribute to the peptide count. To look at the replicability of these spike-ins, we would reach back to the 'df_droprows' table generated by jinspect.MQ_dropDuplicateIDs( ) in section 7.

In [26]:
#### Extract the total peptides observed per sample
metaStats = jinspect.MQ_getSumBySample(pGroup_log2, metadata_sort, freqList = ['Peptides'], measure = False)
metaStats
Out[26]:
experiment condition replicate sample measure MQgroups Peptides
0 05v2_1_pH4OdT_NC-A pH4.5_nCL A 01_pH4.5_nCL-A Intensity pH4.5 145.0
1 05v2_2_pH4OdT_NC-B pH4.5_nCL B 02_pH4.5_nCL-B Intensity pH4.5 109.0
2 05v2_3_pH4OdT_NC-C pH4.5_nCL C 03_pH4.5_nCL-C Intensity pH4.5 75.0
3 05v2_4_pH4OdT_NC-D pH4.5_nCL D 04_pH4.5_nCL-D Intensity pH4.5 54.0
4 05v2_5_pH4OdT_NC-E pH4.5_nCL E 05_pH4.5_nCL-E Intensity pH4.5 196.0
5 05v2_6_pH4OdT_NC-F pH4.5_nCL F 06_pH4.5_nCL-F Intensity pH4.5 58.0
6 05v2_19_pH4OdT_PC-A pH4.5_254 A 19_pH4.5_254-A Intensity pH4.5 5812.0
7 05v2_20_pH4OdT_PC-B pH4.5_254 B 20_pH4.5_254-B Intensity pH4.5 4299.0
8 05v2_21_pH4OdT_PC-C pH4.5_254 C 21_pH4.5_254-C Intensity pH4.5 5054.0
9 05v2_22_pH4OdT_PC-D pH4.5_254 D 22_pH4.5_254-D Intensity pH4.5 6127.0
10 05v2_23_pH4OdT_PC-E pH4.5_254 E 23_pH4.5_254-E Intensity pH4.5 6398.0
11 05v2_24_pH4OdT_PC-F pH4.5_254 F 24_pH4.5_254-F Intensity pH4.5 6244.0
12 05v2_7_pH5OdT_NC-A pH5.5_nCL A 07_pH5.5_nCL-A Intensity pH5.5 54.0
13 05v2_8_pH5OdT_NC-B pH5.5_nCL B 08_pH5.5_nCL-B Intensity pH5.5 66.0
14 05v2_9_pH5OdT_NC-C pH5.5_nCL C 09_pH5.5_nCL-C Intensity pH5.5 41.0
15 05v2_10_pH5OdT_NC-D pH5.5_nCL D 10_pH5.5_nCL-D Intensity pH5.5 54.0
16 05v2_11_pH5OdT_NC-E pH5.5_nCL E 11_pH5.5_nCL-E Intensity pH5.5 60.0
17 05v2_12_pH5OdT_NC-F pH5.5_nCL F 12_pH5.5_nCL-F Intensity pH5.5 40.0
18 05v2_25_pH5OdT_PC-A pH5.5_254 A 25_pH5.5_254-A Intensity pH5.5 4491.0
19 05v2_26_pH5OdT_PC-B pH5.5_254 B 26_pH5.5_254-B Intensity pH5.5 7085.0
20 05v2_27_pH5OdT_PC-C pH5.5_254 C 27_pH5.5_254-C Intensity pH5.5 8959.0
21 05v2_28_pH5OdT_PC-D pH5.5_254 D 28_pH5.5_254-D Intensity pH5.5 8553.0
22 05v2_29_pH5OdT_PC-E pH5.5_254 E 29_pH5.5_254-E Intensity pH5.5 8817.0
23 05v2_30_pH5OdT_PC-F pH5.5_254 F 30_pH5.5_254-F Intensity pH5.5 8999.0
24 05v2_13_pH8OdT_NC-A pH7.8_nCL A 13_pH7.8_nCL-A Intensity pH7.8 86.0
25 05v2_14_pH8OdT_NC-B pH7.8_nCL B 14_pH7.8_nCL-B Intensity pH7.8 80.0
26 05v2_15_pH8OdT_NC-C pH7.8_nCL C 15_pH7.8_nCL-C Intensity pH7.8 74.0
27 05v2_16_pH8OdT_NC-D pH7.8_nCL D 16_pH7.8_nCL-D Intensity pH7.8 90.0
28 05v2_17_pH8OdT_NC-E pH7.8_nCL E 17_pH7.8_nCL-E Intensity pH7.8 55.0
29 05v2_18_pH8OdT_NC-F pH7.8_nCL F 18_pH7.8_nCL-F Intensity pH7.8 93.0
30 05v2_31_pH8OdT_PC-A pH7.8_254 A 31_pH7.8_254-A Intensity pH7.8 9681.0
31 05v2_32_pH8OdT_PC-B pH7.8_254 B 32_pH7.8_254-B Intensity pH7.8 9784.0
32 05v2_33_pH8OdT_PC-C pH7.8_254 C 33_pH7.8_254-C Intensity pH7.8 10240.0
33 05v2_34_pH8OdT_PC-D pH7.8_254 D 34_pH7.8_254-D Intensity pH7.8 9840.0
34 05v2_35_pH8OdT_PC-E pH7.8_254 E 35_pH7.8_254-E Intensity pH7.8 9964.0
35 05v2_36_pH8OdT_PC-F pH7.8_254 F 36_pH7.8_254-F Intensity pH7.8 10469.0
In [27]:
#### Plot the sum peptides
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Peptides', title = 'Sum Peptides vs pH enrichment', pal = set2_paired,
                          errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: Interesting

Interesting that recovery from the interphase is linearly associated with increasing pH.
The OdT hybridisation should not have been affected- the hybridisation conditions were similarly balanced after interphase recovery.

12. Compare Unique Gene Counts

Functions
jinspect.MQ_getFrequencyBySample( )

One gene can encode for many proteins that often share regions of similarity. As for illumina-based RNA-Seq, however, shotgun proteomics can rarely assign a peptide species to a singular protein. In MaxQuant these are called proteinGroups. Because we have do not require protein-specific results, and gene identity is more stable, our gene count describes the groups to which our detected proteins have been be assigned. Thus gene here is being detected by protein product, just as it would be detected by RNA product in RNA Seq; none of these 3 are synonymous. To be clear, this is a count and not a measure.

Gene frequency is defined by the summed observations per protein regardless of intensity value and this data is extracted to our modified metadata with jinspect.MQ_getFrequencyBySample( ) .
A typical MQ search will yield identical protein counts (though different values) for Intensity and iBAQ*. LFQ frequencies will vary depending on the search settings:

  • In this case the MQ search has set LFQ values to be calculated on a min 2 peptide ratio (this is the default)**

Notes
* Why protein counts should be identical I don't know. The original iBAQ paper stipulates rules for the inclusion of a protein in the iBAQ calculation but MaxQuant doesn't seem to apply them.
** Previously I tested LFQ min ratio at 1 peptide. At 1 minimum peptide there was unexpected QC clustering. Possible explanations for this are explained in section 7 and are cleaned up by jinspect.MQ_dropDuplicateIDs( ) function. We can expect this function to greatly reduce qualifying IDs (~20% fewer), especially in the QE samples, but I think the trade-off is worth it because we gain 1) a more robust ID check and 2) the same search can be used for LFQ based checks of dynamic changes, i.e. comparing more than one group of cCL captures for biological changes.

In [28]:
#### Count the number of unique 
metaStats = jinspect.MQ_getFrequencyBySample(pGroup_log2, metaStats, freqList = ['Intensity', 'iBAQ', 'LFQ intensity'], measure = False)
metaStats
Out[28]:
experiment condition replicate sample measure MQgroups Peptides Intensity iBAQ LFQ intensity
0 05v2_1_pH4OdT_NC-A pH4.5_nCL A 01_pH4.5_nCL-A Intensity pH4.5 145.0 30 30 30
1 05v2_2_pH4OdT_NC-B pH4.5_nCL B 02_pH4.5_nCL-B Intensity pH4.5 109.0 23 23 23
2 05v2_3_pH4OdT_NC-C pH4.5_nCL C 03_pH4.5_nCL-C Intensity pH4.5 75.0 16 16 16
3 05v2_4_pH4OdT_NC-D pH4.5_nCL D 04_pH4.5_nCL-D Intensity pH4.5 54.0 13 13 13
4 05v2_5_pH4OdT_NC-E pH4.5_nCL E 05_pH4.5_nCL-E Intensity pH4.5 196.0 64 64 64
5 05v2_6_pH4OdT_NC-F pH4.5_nCL F 06_pH4.5_nCL-F Intensity pH4.5 58.0 21 21 21
6 05v2_19_pH4OdT_PC-A pH4.5_254 A 19_pH4.5_254-A Intensity pH4.5 5812.0 632 632 632
7 05v2_20_pH4OdT_PC-B pH4.5_254 B 20_pH4.5_254-B Intensity pH4.5 4299.0 529 529 529
8 05v2_21_pH4OdT_PC-C pH4.5_254 C 21_pH4.5_254-C Intensity pH4.5 5054.0 570 570 570
9 05v2_22_pH4OdT_PC-D pH4.5_254 D 22_pH4.5_254-D Intensity pH4.5 6127.0 650 650 650
10 05v2_23_pH4OdT_PC-E pH4.5_254 E 23_pH4.5_254-E Intensity pH4.5 6398.0 690 690 690
11 05v2_24_pH4OdT_PC-F pH4.5_254 F 24_pH4.5_254-F Intensity pH4.5 6244.0 794 794 794
12 05v2_7_pH5OdT_NC-A pH5.5_nCL A 07_pH5.5_nCL-A Intensity pH5.5 54.0 13 13 13
13 05v2_8_pH5OdT_NC-B pH5.5_nCL B 08_pH5.5_nCL-B Intensity pH5.5 66.0 16 16 16
14 05v2_9_pH5OdT_NC-C pH5.5_nCL C 09_pH5.5_nCL-C Intensity pH5.5 41.0 22 22 22
15 05v2_10_pH5OdT_NC-D pH5.5_nCL D 10_pH5.5_nCL-D Intensity pH5.5 54.0 12 12 12
16 05v2_11_pH5OdT_NC-E pH5.5_nCL E 11_pH5.5_nCL-E Intensity pH5.5 60.0 11 11 11
17 05v2_12_pH5OdT_NC-F pH5.5_nCL F 12_pH5.5_nCL-F Intensity pH5.5 40.0 9 9 9
18 05v2_25_pH5OdT_PC-A pH5.5_254 A 25_pH5.5_254-A Intensity pH5.5 4491.0 513 513 513
19 05v2_26_pH5OdT_PC-B pH5.5_254 B 26_pH5.5_254-B Intensity pH5.5 7085.0 759 759 759
20 05v2_27_pH5OdT_PC-C pH5.5_254 C 27_pH5.5_254-C Intensity pH5.5 8959.0 853 853 853
21 05v2_28_pH5OdT_PC-D pH5.5_254 D 28_pH5.5_254-D Intensity pH5.5 8553.0 837 837 837
22 05v2_29_pH5OdT_PC-E pH5.5_254 E 29_pH5.5_254-E Intensity pH5.5 8817.0 858 858 858
23 05v2_30_pH5OdT_PC-F pH5.5_254 F 30_pH5.5_254-F Intensity pH5.5 8999.0 985 985 985
24 05v2_13_pH8OdT_NC-A pH7.8_nCL A 13_pH7.8_nCL-A Intensity pH7.8 86.0 13 13 13
25 05v2_14_pH8OdT_NC-B pH7.8_nCL B 14_pH7.8_nCL-B Intensity pH7.8 80.0 17 17 17
26 05v2_15_pH8OdT_NC-C pH7.8_nCL C 15_pH7.8_nCL-C Intensity pH7.8 74.0 14 14 14
27 05v2_16_pH8OdT_NC-D pH7.8_nCL D 16_pH7.8_nCL-D Intensity pH7.8 90.0 17 17 17
28 05v2_17_pH8OdT_NC-E pH7.8_nCL E 17_pH7.8_nCL-E Intensity pH7.8 55.0 13 13 13
29 05v2_18_pH8OdT_NC-F pH7.8_nCL F 18_pH7.8_nCL-F Intensity pH7.8 93.0 35 35 35
30 05v2_31_pH8OdT_PC-A pH7.8_254 A 31_pH7.8_254-A Intensity pH7.8 9681.0 941 941 941
31 05v2_32_pH8OdT_PC-B pH7.8_254 B 32_pH7.8_254-B Intensity pH7.8 9784.0 954 954 954
32 05v2_33_pH8OdT_PC-C pH7.8_254 C 33_pH7.8_254-C Intensity pH7.8 10240.0 960 960 960
33 05v2_34_pH8OdT_PC-D pH7.8_254 D 34_pH7.8_254-D Intensity pH7.8 9840.0 957 957 957
34 05v2_35_pH8OdT_PC-E pH7.8_254 E 35_pH7.8_254-E Intensity pH7.8 9964.0 971 971 971
35 05v2_36_pH8OdT_PC-F pH7.8_254 F 36_pH7.8_254-F Intensity pH7.8 10469.0 1103 1103 1103
In [29]:
#### Plot the counts
sns.set_style('whitegrid')
jvis.BarPlotByGroup_sbplot(metaStats, x_col = 'condition', y_col = 'Intensity', title = '# Genes Detected By Group', pal = set2_paired, ylabel = 'Unique Genes', errorbars = 'SEM')
<Figure size 432x288 with 0 Axes>

Results: GOOD

This experiment tested the effect of pH variation on the subsequent Qdt-mediated capture of RNP from the interphase

  • pH4.5 yields the fewest < PH5.5 < pH 7.8
  • All have clean nCL

13. Assess Replicate Correlation

Functions
jwrangle.MQ_getSliceByPrefix( )
jvis.showPearsonRegression_altair( )

The function _jwrangle.MQgetSliceByPrefix( ) provides a convenient means of extracting values of a specific group.
We can then use _jvis.showPearsonRegressionaltair( ) to perform pairwise comparisons between each member of those groups. This function is specifically applied to genes with shared intensities- genes exclusive to one sample or the other, represented by vertical or horizontal datapoints, are plotted but excluded from the pearson calculation.

In [30]:
#### Extract the intensity values as a dictionary where keys = groups
Intensity_Dict = jwrangle.MQ_getSliceByPrefix(pGroup_log2, metadata, 'Intensity', group = 'condition', add_col = None)
Intensity_Dict.keys()
Out[30]:
dict_keys(['pH4.5_nCL', 'pH5.5_nCL', 'pH7.8_nCL', 'pH4.5_254', 'pH5.5_254', 'pH7.8_254'])
In [31]:
#### Check replicate consistency across all within group pairs
jvis.showPearsonRegression_altair(Intensity_Dict['pH4.5_nCL'], mark_color = set2_paired[2])
jvis.showPearsonRegression_altair(Intensity_Dict['pH4.5_254'], mark_color = set2_paired[2])
jvis.showPearsonRegression_altair(Intensity_Dict['pH5.5_nCL'], mark_color = set2_paired[4])
jvis.showPearsonRegression_altair(Intensity_Dict['pH5.5_254'], mark_color = set2_paired[4])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7.8_nCL'], mark_color = set2_paired[6])
jvis.showPearsonRegression_altair(Intensity_Dict['pH7.8_254'], mark_color = set2_paired[6])

Results: GOOD

Replicates look good among 254nm cCL samples. More varied among the sparse nCL samples which should be expected.

14. Classify RBP

Functions
jinspect.MQ_getFrequencyByGroup()
jtest.MQ_applyClassifyRBP()

Before classifying our RBP we need to first tally the frequency with which each protein appears in each condition using _jinspect.MQgetFrequencyByGroup( )
Once done, we use _jtest.MQapplyClassifyRBP( ) to generate a dictionary from which each class can be reviewed or plotted.

In [32]:
#### Use the metadata and proteinGroups tables to count how many times a gene is identified in its group (/6). 
#### Here I demonstrate how we can count for all instances of Intensity, iBAQ and LFQ Intensity
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_log2, metadata, 'iBAQ', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'LFQ intensity', group = 'condition')
pGroup_Freq = jinspect.MQ_getFrequencyByGroup(pGroup_Freq, metadata, 'Intensity', group = 'condition')
pGroup_Freq[['ENTREZGENE_gPro primary'] + [i for i in pGroup_Freq.columns if 'Freq' in i]].head(2)
Out[32]:
ENTREZGENE_gPro primary iBAQ Freq: pH4.5_nCL iBAQ Freq: pH5.5_nCL iBAQ Freq: pH7.8_nCL iBAQ Freq: pH4.5_254 iBAQ Freq: pH5.5_254 iBAQ Freq: pH7.8_254 LFQ intensity Freq: pH4.5_nCL LFQ intensity Freq: pH5.5_nCL LFQ intensity Freq: pH7.8_nCL LFQ intensity Freq: pH4.5_254 LFQ intensity Freq: pH5.5_254 LFQ intensity Freq: pH7.8_254 Intensity Freq: pH4.5_nCL Intensity Freq: pH5.5_nCL Intensity Freq: pH7.8_nCL Intensity Freq: pH4.5_254 Intensity Freq: pH5.5_254 Intensity Freq: pH7.8_254
0 HDLBP 0 0 0 6 6 6 0 0 0 6 6 6 0 0 0 6 6 6
1 RPS9 1 0 1 6 6 6 1 0 1 6 6 6 1 0 1 6 6 6
In [33]:
#### Now we can classify RBP; the console will report if all proteins/genes have been properly classified or not.
RBP_Dict_pH4 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'pH4.5', 'pH4.5_nCL', 'pH4.5_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])

RBP_Dict_pH5 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'pH5.5', 'pH5.5_nCL', 'pH5.5_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])

RBP_Dict_pH7 = jtest.MQ_applyClassifyRBP(pGroup_Freq, 'LFQ intensity', metadata, 'pH7.8', 'pH7.8_nCL', 'pH7.8_254', 3, 
                                         add_cols = ['ENTREZGENE_gPro all', 'ENTREZGENE_gPro primary', 'ENTREZGENE_gPro name'])
All proteins classified = True
All proteins classified = True
All proteins classified = True
In [34]:
#### The results of this classifcation can be found in a dictionary of dataframes for each output
#### See help() for an explanation of each dataset
RBP_Dict_pH7.keys()
Out[34]:
dict_keys(['Input_df_annStatus', 'Summary_df_annStatus', 'I', 'IIa', 'IIb', 'IIc', 'NC', 'ND'])
In [35]:
#### We want the most general overview which of our classes per treatment and so concatenate the results from each Summary_df_annStatus
RBP_Class = pd.concat([RBP_Dict_pH4['Summary_df_annStatus'], RBP_Dict_pH5['Summary_df_annStatus'], RBP_Dict_pH7['Summary_df_annStatus']], axis=0, join='outer')

#### And represent them in a barplot
from matplotlib import pyplot as plt

plt.figure('rbp class')
sns.set_style('whitegrid')
ax = sns.countplot(x='MQgroup', hue = 'RBP Class', data=RBP_Class, palette = cpal['RBP_Class'], edgecolor = 'black')
ax.set_ylabel('Unique Gene Count')
ax.set_title('Unique Genes detected per RBP Class')

RBP_Class.head(1)
Out[35]:
ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name Gene names RBP Class RBP subClass MQgroup
0 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... HDLBP I pH4.5
In [36]:
#### If we are to look more closely at Summary_df_annStatus we can see it includes the subclasses of the RBP class 2 group
RBP_Dict_pH4['Summary_df_annStatus'].head(10)
Out[36]:
ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name Gene names RBP Class RBP subClass MQgroup
0 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... HDLBP I pH4.5
1 RPS9 RPS9 ribosomal protein S9 [Source:HGNC Symbol;Acc:H... RPS9 II b pH4.5
2 YTHDF3 YTHDF3 YTH N6-methyladenosine RNA binding protein 3 [... YTHDF3 NC pH4.5
3 ELOA ELOA elongin A [Source:HGNC Symbol;Acc:HGNC:11620] TCEB3 NC pH4.5
5 MRPS21;None MRPS21 mitochondrial ribosomal protein S21 [Source:HG... MRPS21 NC pH4.5
6 DCAF13 DCAF13 DDB1 and CUL4 associated factor 13 [Source:HGN... DCAF13 I pH4.5
10 None;RPS29 RPS29 ribosomal protein S29 [Source:HGNC Symbol;Acc:... RPS29 NC pH4.5
11 None;UHRF1 UHRF1 ubiquitin like with PHD and ring finger domain... UHRF1 NC pH4.5
13 HNRNPDL HNRNPDL heterogeneous nuclear ribonucleoprotein D like... HNRNPDL I pH4.5
14 None;RPS24 RPS24 ribosomal protein S24 [Source:HGNC Symbol;Acc:... RPS24 II b pH4.5
In [37]:
#### Because each subclass of the class II RBP are identified based on different statistical assumptions we can more closely inspect if those assumptions trend differently among the different pH treatments.
plt.figure('rbp class')
sns.set_style('whitegrid')
ax = sns.countplot(x='MQgroup', hue = 'RBP subClass', data=RBP_Class[RBP_Class['RBP subClass']!=''].sort_values(by=['MQgroup','RBP subClass']), palette = cpal['Set3_qual'], edgecolor = 'black')
ax.set_ylabel('Unique Gene Count')
ax.set_title('Unique Genes detected per RBP Class')

RBP_Class.head(1)
Out[37]:
ENTREZGENE_gPro all ENTREZGENE_gPro primary ENTREZGENE_gPro name Gene names RBP Class RBP subClass MQgroup
0 HDLBP;None HDLBP high density lipoprotein binding protein [Sour... HDLBP I pH4.5

Results: GOOD

The number of class I binders rises steadily from pH 4.5 < pH 5.5 < pH 7.8.
Between classes I, II, NC the overwhleming majority of proteins in all treatment conditions are identified as high confidence, class I binders.
There is an interesting distribution difference of Class II subclasses between pH 4.5 + pH 7.8 vs pH 5.5. This differences includes a majority landing in the IIb category. These are for proteins detected in all cCL samples but also n=1 nCL sample. The reason for this difference is not clear but should be monitored in expt.297 to see if the same trend is confirmed.

15. Compare RBP Identity Between pH Treatments

Here we use the dictionaries output by jtest.MQ_applyClassifyRBP( ) for the purposes of creating a Venn Diagram.

In [38]:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn3, venn3_circles

plt.figure(figsize=(6,6))

v = venn3([set(RBP_Dict_pH4['I']['ENTREZGENE_gPro primary']), 
           set(RBP_Dict_pH5['I']['ENTREZGENE_gPro primary']),
           set(RBP_Dict_pH7['I']['ENTREZGENE_gPro primary'])], 
           set_labels = ('pH4.5', 'pH5.5', 'pH7.8'),
           alpha = 0.5)

c = venn3_circles([set(RBP_Dict_pH4['I']['ENTREZGENE_gPro primary']), 
                   set(RBP_Dict_pH5['I']['ENTREZGENE_gPro primary']), 
                   set(RBP_Dict_pH7['I']['ENTREZGENE_gPro primary'])], linestyle='dashed')

c[0].set_lw(1.5)
c[0].set_ls('dotted')
c[1].set_lw(1.5)
c[1].set_ls('dotted')
c[2].set_lw(1.5)
c[2].set_ls('dotted')

for text in v.set_labels:
    text.set_fontsize(18)
for text in v.subset_labels:
    text.set_fontsize(14)

Results: GOOD

If there was any doubt that the different pH treatments might generate qualitatively different protein sets the near complete overlap shown above would dispel this.
Outwardly it would appear that all Class I identified RBPs fall within the next highest pH group and their total number expands with the same trend.

16. Compare RBP Isoelectric Points Between pH Treatments

Functions
jinspect.getIsoelectricPoints( )
jinspect.getIsoelectricPointsByClassDict( )
jwrangle.SplitList( )

Many algorithms exist for the calculation of isoelectric points (pI) by sequence analysis. Rather than calculating this data on the fly, I like to use a precalculated database published by Kozlowski 2016 and provided at isoelectricpointdb.org. This database provides precalculated pIs according to a variety of published models, and also an average pI which spans the most consisten algorithms. The function jinspect.getIsoelectricPointsByClassDict( ) will retrieve, as a list, all pIs associated with the proteins in our Class Dictionaries. The lists include all members of the Majority Protein Group that are associated with the identified gene; thus one gene may return more than one pI. This is a deliberate choice; because shotgun proteomics cannot properly discriminate between protein isoforms one sequence cannot be selected to solely represent a gene wihtout introducing user-bias. The majority protein groups is a best guess of the isoforms present and thus the pIs for each reported.

If users wish to curate their own list for pI reporting they should instead use jinspect.getIsoelectricPoints( ). To do so we'll use jwrangle.SplitList( ) to create a list of proteins from instrument QC data and then retrieve the relevant pI data.

In [39]:
#### Read in isoelectric point database (download from isoelectricpointdb.org)
dfisoelectric_point_database = pd.read_csv(base_path / 'downloads' / 'isoelectric_points' / '180822_HUMAN_isoelectricpointdb-org_edit.csv', index_col=False)

#### Calculate the pIs for every protein detected in each class
RBP_Dict_pH4_pI = jinspect.getIsoelectricPointsByClassDict(RBP_Dict_pH4, dfisoelectric_point_database)
RBP_Dict_pH5_pI = jinspect.getIsoelectricPointsByClassDict(RBP_Dict_pH5, dfisoelectric_point_database)
RBP_Dict_pH7_pI = jinspect.getIsoelectricPointsByClassDict(RBP_Dict_pH7, dfisoelectric_point_database)

#### Inspect the keys for each pI dictionary
RBP_Dict_pH7_pI.keys()
Out[39]:
dict_keys(['I_pI', 'IIa_pI', 'IIb_pI', 'IIc_pI', 'ND_pI', 'II_pI'])
In [40]:
#### For comparison we'll also import a proteins typically found in a whole proteome analysis  
#### To do this we'll grab some HeLa digest quality control data from the same instrument these experiments were run on
hela_qc = pd.read_csv(base_path / 'my_resources' / 'instrumentQC' / 'QE_191215' / 'proteinGroups.txt', delimiter='\t',index_col=False)

### Create a new list from the Majority protein IDs and get pIs
hela_qcProt = jwrangle.SplitList(hela_qc['Majority protein IDs'].tolist(), [';','-'], replace = '$&', remove_nonstring_values = True, drop_empty_strings = True)
hela_qc_pI = jinspect.getIsoelectricPoints(hela_qcProt, dfisoelectric_point_database)
hela_qc_pI.head(2)
Out[40]:
header sequence molecular_weight Bjellqvist DTASelect Dawson EMBOSS Grimsley IPC_peptide IPC_protein ... Patrickios ProMoST Rodwell Sillero Solomon Thurlkill Toseland Wikipedia Avg_pI edit: Uniprot ID
7369 >sp|P62891|RL39_HUMAN 60S ribosomal protein L3... MSSHKTFRIKRFLAKKQKQNRPIPQWIRMKTGNKIRYNSKRRHWR... 6406.68 12.559 12.559 12.574 13.056 12.603 13.056 12.398 ... 12.135 13.042 12.398 12.559 13.056 12.559 12.559 13.027 12.720 P62891
7614 >sp|P98179|RBM3_HUMAN RNA-binding protein 3 OS... MSSEEGKLFVGGLNFNTDEQALEDHFSSFGPISEVVVVKDRETQR... 17170.37 8.858 8.902 9.063 9.136 9.194 9.238 8.990 ... 4.202 8.697 9.033 8.975 9.253 8.829 8.741 9.355 9.002 P98179

2 rows × 22 columns

In [41]:
plt.figure('pH4.5 OdT RBP Classes: Isoelectric Points')

sns.set_style('whitegrid')

ax = sns.kdeplot(hela_qc_pI['Avg_pI'], label = 'Proteome', shade = True, alpha = 0.2)
ax = sns.kdeplot(RBP_Dict_pH4_pI['I_pI'], label = 'Class I')
ax = sns.kdeplot(RBP_Dict_pH4_pI['II_pI'], label = 'Class II')
ax = sns.kdeplot(RBP_Dict_pH4_pI['ND_pI'], label = 'Class ND')

ax.set_title('Isoelectric Point: RBP Class (pH4.5)', size = 16)
plt.ylabel('Density', size = 14)
plt.xlabel('pI', size = 14)
plt.legend()
plt.tight_layout()
sns.despine()
plt.close
Out[41]:
<function matplotlib.pyplot.close(fig=None)>
In [42]:
plt.figure('pH5.5 OdT RBP Classes: Isoelectric Points')

sns.set_style('whitegrid')

ax = sns.kdeplot(hela_qc_pI['Avg_pI'], label = 'Proteome', shade = True, alpha = 0.2)
ax = sns.kdeplot(RBP_Dict_pH5_pI['I_pI'], label = 'Class I')
ax = sns.kdeplot(RBP_Dict_pH5_pI['II_pI'], label = 'Class II')
ax = sns.kdeplot(RBP_Dict_pH5_pI['ND_pI'], label = 'Class ND')

ax.set_title('Isoelectric Point: RBP Class (pH5.5)', size = 16)
plt.ylabel('Density', size = 14)
plt.xlabel('pI', size = 14)
plt.legend()
plt.tight_layout()
sns.despine()
plt.close
Out[42]:
<function matplotlib.pyplot.close(fig=None)>
In [43]:
plt.figure('pH7.8 OdT RBP Classes: Isoelectric Points')

sns.set_style('whitegrid')

ax = sns.kdeplot(hela_qc_pI['Avg_pI'], label = 'Proteome', shade = True, alpha = 0.2)
ax = sns.kdeplot(RBP_Dict_pH7_pI['I_pI'], label = 'Class I')
ax = sns.kdeplot(RBP_Dict_pH7_pI['II_pI'], label = 'Class II')
ax = sns.kdeplot(RBP_Dict_pH7_pI['ND_pI'], label = 'Class ND')

ax.set_title('Isoelectric Point: RBP Class (pH7.8)', size = 16)
plt.ylabel('Density', size = 14)
plt.xlabel('pI', size = 14)
plt.legend()
plt.tight_layout()
sns.despine()
plt.close
Out[43]:
<function matplotlib.pyplot.close(fig=None)>

Results: GOOD

Here all 3 pH treatments produce the expected bias towards higher pI.
The changing distribution among the ND category as pH increases, however, suggests selectivity bias might exist. Indeed, using pH to separate macromolecules introduces the risk of selectivity bias in the data- especially where isoelectric points are used as evidence of RBP identity.
It is reasonable to assume that the original RIC protocol, being conducted entirely in neutral buffers will most accurately represent the expected pIs of RBPs as a group.

17. Compare RBP Class I Identity with Previous Studies

Functions
jwrangle.ListsToDataFrameSets( )
jwrangle.importMixedFiles( )

Comparing RBP candidates to those of other studies often relies on retrieving previously prepared lists and dataframes. For this we can use jwrangle.importMixedFiles() or read them directly from a resources folder (you'll need to set up your own).
After retrieving the relevant lists we can create upsetplots for inspecting data. The function jwrangle.ListsToDataFrameSets() helps here by creating multi index dataframes for plotting.
A number of cells are hashed out below. This is because they focus on once-only data wrangling of published data which, once complete, can be saved locally as a new copy and read in for future work. It is kept in this tutorial as an example of how functions such as jweb.mapAnyID_gPro( ) can be used on third party data.

In [44]:
#### Import the resources directory
# Hentze_2018 = jwrangle.importMixedFiles(base_path / 'downloads' / 'published_data' / '2018_Hentze_Nature_Reviews_s2table')
# Hentze_2018.keys()
In [45]:
#### In this example we want the Human Datasets
# Hentze_2018_Hs = Hentze_2018['Hs.csv']
# Hentze_2018_Hs.head(2)
In [46]:
#### That table produces an ugly import, let's fix up the headers and empty cells
# r1 = Hentze_2018_Hs.columns.tolist()
# r2 = Hentze_2018_Hs.loc[0,:].replace(np.nan, '').tolist()

# r3 = []
# for index, label in enumerate(r1):
#     r = label + ' (' + r2[index] + ')'
#     r = r.replace(' ()', '')
#     r3.append(r)
    
# Hentze_2018_Hs.columns = r3
# Hentze_2018_Hs = Hentze_2018_Hs.iloc[1:,:13]
# Hentze_2018_Hs.head(2)
In [47]:
#### To ensure naming consistency we will remap the ENSEMBL genes to ENTREZGENE.
#### I've prefer ENTREZGENE because it shares the same conventions as Uniprot, QuickGo, and MaxQuant.
# Hentze_2018_Hs_remap = jweb.mapAnyID_gPro(Hentze_2018_Hs['ID'].tolist(), splitstr = 'nosplit', geneProductType = 'gene', 
#                               gConvertOrganism = 'hsapiens', gConvertTarget = 'ENTREZGENE', writetopath = [cwd, 'Hentze_2018_Hs_remap'], writeTargetsAsList = 'NO')

#### For posterity, let's add the remapped gene names to the original dataframe and save it as an edited copy in our resources folder.
# Hentze_id_map = Hentze_2018_Hs_remap['id_map'].rename(columns={'Query': 'ID'}).drop_duplicates()
# Hentze_2018_Hs_edit = pd.merge(Hentze_2018_Hs, Hentze_id_map, on='ID', how='left')
# Hentze_2018_Hs_edit.to_csv(base_path / 'downloads' / 'published_data' / 'Hentze_2018_Hs_edit.csv', index = False )
# Hentze_2018_Hs_edit.head(2)
In [48]:
#### If not already loaded, read in the remapped resource data
Hentze_2018_Hs_edit = pd.read_csv(base_path / 'downloads' / 'published_data' / 'Hentze_2018_Hs_edit.csv')

#### We'll focus on the HEK293T RBPs identified by Baltz and begin by extracting the Gene names and remapping them to ensure consistency.
Baltz_HEK293_RIC = Hentze_2018_Hs_edit[Hentze_2018_Hs_edit['HEK293-RIC (Hs_Baltz2012)']=='YES']['ENTREZGENE_gPro primary']

#### Create upset plot to inspect overlap between the HEK293T RBPs identified by Baltz, and the class I RBPs discovered by our protocol at different pHs.
import upsetplot as upset

dict_list = {'Baltz_HEK293':list(set(Baltz_HEK293_RIC)),
             'pH4.5':list(set(RBP_Dict_pH4['I']['ENTREZGENE_gPro primary'])),
             'pH5.5':list(set(RBP_Dict_pH5['I']['ENTREZGENE_gPro primary'])),
             'pH7.8':list(set(RBP_Dict_pH7['I']['ENTREZGENE_gPro primary']))}

upsetDF = jwrangle.ListsToDataFrameSets(dict_list)
cols = upsetDF.columns.difference(['gene']).tolist()
group_upsetDF = upsetDF.groupby(cols).size()

plt.figure('Common Class I RBPs')
ax = upset.plot(group_upsetDF,sort_by='cardinality', show_counts = True,  facecolor= cpal['RBP_Class'][0], element_size =50)
plt.title('Class I RBPs shared with Baltz', size = 16)
plt.close
Out[48]:
<function matplotlib.pyplot.close(fig=None)>
<Figure size 432x288 with 0 Axes>
In [49]:
#### And what would the result be if we took all RBPs identified across the 7 studies?
dict_list = {'All RIC':list(set(Hentze_2018_Hs_edit['ENTREZGENE_gPro primary'])),
             'pH4.5':list(set(RBP_Dict_pH4['I']['ENTREZGENE_gPro primary'])),
             'pH5.5':list(set(RBP_Dict_pH5['I']['ENTREZGENE_gPro primary'])),
             'pH7.8':list(set(RBP_Dict_pH7['I']['ENTREZGENE_gPro primary']))}

upsetDF = jwrangle.ListsToDataFrameSets(dict_list)
cols = upsetDF.columns.difference(['gene']).tolist()
group_upsetDF = upsetDF.groupby(cols).size()

plt.figure('Common Class I RBPs')
ax = upset.plot(group_upsetDF,sort_by='cardinality', show_counts = True,  facecolor= cpal['Set1_qual'][3], element_size =50)
plt.title('Class I RBPs shared with Baltz', size = 16)
plt.close
Out[49]:
<function matplotlib.pyplot.close(fig=None)>
<Figure size 432x288 with 0 Axes>

Results: GOOD

The majority of our Class I RBPs match those found by Baltz et al. These upsetplots, however can be difficult to intpret.

Let's focus on the following features:

  • The Baltz only comparison (blue): ~200-500 genes not shared with Baltz, though most are. ~200 genes exclusive to Baltz.
  • The All RIC comparison (purple): ~600 genes from all RIC not found. Though this is not important. What is important, is that ~120 genes (from pH5.5 and pH 7.8) al not found in any of the All RIC studies. This is a pretty small number, giving confidence in the capture, but also a large enough number that we could investigate as novel RBPs. In addition, these can be investigated with relation to the silane capture.
  • I think we also need a Venn plot- overlap between pH and the Class I and IIs; also overlap between classes of the same pH.
  • Inspect overlaps with respect to those found in pairingsof the All RIC.

Some are new and some found by Baltz

18. Compare RBP Class I pI distribution to Previous Studies

Functions
jweb.mapAnyID_gPro( )
jinspect.getIsoelectricPoints( )
jwrangle.SplitList( )
jwrangle.importMixedFiles( )

Comparing RBP candidates to those of other studies often relies on retrieving previously prepared lists and dataframes. For this we can use jwrangle.importMixedFiles() or read them directly from a resources folder (you'll need to set up your own).
After retrieving the relevant lists we can create upsetplots for inspecting data. The function jwrangle.ListsToDataFrameSets() helps here by creating multi index dataframes for plotting.
A number of cells are hashed out below. This is because they focus on once-only data wrangling of published data which, once complete, can be saved locally as a new copy and read in for future work. It is kept in this tutorial as an example of how functions such as jweb.mapAnyID_gPro( ) can be used on third party data.

In [50]:
#### Lets also draw the isoelectric profile comparison against Baltz.  
#### To do so we'll need to fetch the Unprot protein names and 
Baltz_protList = jwrangle.SplitList(Hentze_2018_Hs_edit[Hentze_2018_Hs_edit['HEK293-RIC (Hs_Baltz2012)']=='YES']['ID'].tolist(), [';','-'], replace = '$&', remove_nonstring_values = True, drop_empty_strings = True)

#### Now we'll convert the ENSEMBL IDs from our resource table into Uniprot IDs and save the results locally
# Baltz_HEK293_RICprot = jweb.mapAnyID_gPro(Baltz_protList, splitstr = 'nosplit', geneProductType = 'gene', gConvertOrganism = 'hsapiens', gConvertTarget = 'UNIPROTSWISSPROT', writetopath = [cwd, 'Baltz_HEK293_RICprot'], writeTargetsAsList = 'NO')

#### If not already loaded, read in the mapped data and inspect
Baltz_HEK293_RICprot = jwrangle.importMixedFiles(cwd / 'downloads' / 'Baltz_HEK293_RICprot', dropSuffix = 'yes')
Baltz_HEK293_RICprot['id_map'].head(2)
Out[50]:
Query UNIPROTSWISSPROT_gPro all UNIPROTSWISSPROT_gPro primary UNIPROTSWISSPROT_gPro name
0 ENSG00000276072 None None None
1 ENSG00000275700 Q9NY61 Q9NY61 apoptosis antagonizing transcription factor [S...
In [51]:
### Now get pIs
Baltz_HEK293_RICprot = jinspect.getIsoelectricPoints(Baltz_HEK293_RICprot['id_map']['UNIPROTSWISSPROT_gPro primary'].tolist(), dfisoelectric_point_database)
Baltz_HEK293_RICprot.head(2)
Out[51]:
header sequence molecular_weight Bjellqvist DTASelect Dawson EMBOSS Grimsley IPC_peptide IPC_protein ... Patrickios ProMoST Rodwell Sillero Solomon Thurlkill Toseland Wikipedia Avg_pI edit: Uniprot ID
7614 >sp|P98179|RBM3_HUMAN RNA-binding protein 3 OS... MSSEEGKLFVGGLNFNTDEQALEDHFSSFGPISEVVVVKDRETQR... 17170.37 8.858 8.902 9.063 9.136 9.194 9.238 8.990 ... 4.202 8.697 9.033 8.975 9.253 8.829 8.741 9.355 9.002 P98179
7375 >sp|P62937|PPIA_HUMAN Peptidyl-prolyl cis-tran... MVNPTVFFDIAVDGEPLGRVSFELFADKVPKTAENFRALSTGEKG... 18012.49 7.688 7.819 7.702 7.966 6.898 7.819 7.088 ... 4.673 7.176 7.688 8.156 7.819 7.878 6.854 7.629 7.632 P62937

2 rows × 22 columns

In [52]:
#### Now plot the pI comparison vs Class I
plt.figure('RBP purification vs Baltz RIC\nIsoelectric Points')

sns.set_style('whitegrid')

ax = sns.kdeplot(hela_qc_pI['Avg_pI'], label = 'Proteome', shade = True, alpha = 0.2)
ax = sns.kdeplot(RBP_Dict_pH7_pI['I_pI'], label = 'Class I', color = 'green')
# ax = sns.kdeplot(RBP_Dict_pH7_pI['II_pI'], label = 'Class II')
ax = sns.kdeplot(Baltz_HEK293_RICprot['Avg_pI'], label = 'Baltz RIC RBP', color = 'red')

ax.set_title('RBP purification vs Baltz RIC\nIsoelectric Points', size = 16)
plt.ylabel('Density', size = 14)
plt.xlabel('pI', size = 14)
plt.legend()
plt.tight_layout()
sns.despine()
plt.close
Out[52]:
<function matplotlib.pyplot.close(fig=None)>

Results: GOOD

Strong agreement between our class I RBP and those of Baltz. Thus while there may not be complete agreement between specific identity, there is total agreement between shared characteristics of the discovered proteins.

19. Review Basic Gene Ontology

Functions
jwrangle.importMixedFiles( )
jweb.fetchQuickGO_stats( )

In this section we will explore Gene Ontology (GO) memberships for the observed proteins. There is little use in applying statistical tests such as Gene Ontology Enrichment Analysis (GOEA) for these experiments; the combination of selection by RNA interaction and the comparative lack of deep RBP validation for many candidates would make such a study rather spurious. We can, however, investigate the frequency with which our identified RBPs appear in previous studies. In addition, we can use this frequency to further assess whether the 20% and 30% capture conditions used here are equivalent or not.

A number of GO-specific and utility functions are provided to help with retrieving Gene Ontologies from the QuickGo database. In this section we'll look at the most basic.
The function jweb.fetchQuickGO_stats( ) will fetch the annotation statistics for all records belonging to the gene ID from a submitted list. These statistics are generally of two types: no. of annotations (which relates to unique proteins) and no. of geneProducts (which relate to the gene count; poorly worded I know). The quality of these records is given by whether identification originates from a SwissPort or a TrEMBL entry. Reviewing these statistics before beginning an analysis is ideal, because it contextualises the breadth of future analyses. This function returns these statistics in the from of a dictionary which can be converted to a dataframe for easier viewing by using the dedicated function jweb.getQuickGO_stats( ).

In addition to contextualising the search space of subsequent analyses, fetching the annotation numbers is important for checking that the number of records, per GO ID, falls below 10000. This is because QuickGo will not allow larger searches to be done programmatically. If your GO ID of interest has many more records users should retrieve their records manually. These details will be covered fourther in the next section.

In [53]:
#### I like to keep a table of interesting GO terms in a local csv file, let's find it
MyResources = jwrangle.importMixedFiles(base_path / 'my_resources')
MyResources.keys()
Out[53]:
dict_keys(['.ipynb_checkpoints', 'control_elements', 'control_proteins', 'custom_dfs', 'FASTA', 'GOexplore.ipynb', 'GO_TermsOfInterest.csv', 'GO_TermsOfInterest_stats.csv', 'Hentze_core_RBP_list.txt', 'Hentze_core_RBP_list_uniprot.txt', 'Hentze_core_RBP_plusE1_8.txt', 'instrumentQC', 'limma_normalizeTIs_p3550_p3592_p3657.nb.html', 'limma_normalizeTIs_p3550_p3592_p3657.Rmd', 'TIs_p3550_p3592_p3657.csv', 'TIs_p3550_p3592_p3657_loess.csv'])
In [54]:
MyResources['GO_TermsOfInterest.csv'].head(5)
Out[54]:
GO Term relationship depth go ID Description ctrl list type parent(s)
0 binding Molecular Function 1 GO:0005488 NaN NaN GO:0003674
1 protein binding Molecular Function 2 GO:0005515 NaN NaN GO:0005488
2 chromatin binding Molecular Function 2 GO:0003682 NaN NaN GO:0005488
3 nucleic acid binding Molecular Function 3 GO:0003676 NaN NaN GO:0003676
4 DNA binding Molecular Function 4 GO:0003677 NaN general GO:0003676
In [55]:
# #### Let's get basic statistics on the all GO IDs listed from depth 3 down"  
# MyGO_Stats_Dict = jweb.fetchQuickGO_stats(MyResources['GO_TermsOfInterest.csv']['go ID'].tolist()[3:], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'])

# #### To view basic information, like the number of annotations associated with the GO term
# MyGO_Stats = jweb.getQuickGO_stats(MyGO_Stats_Dict)

# #### Save a local copy of the stats dataframe
# MyGO_Stats_DF = pd.merge(MyResources['GO_TermsOfInterest.csv'], MyGO_Stats, on = 'go ID', how = 'outer')
# MyGO_Stats_DF.to_csv(cwd / 'downloads' / 'MyGO_Stats_DF.csv', index = False)

# #### View the DataFrame
# MyGO_Stats_DF.head(50)
In [56]:
#### Read a local copy and view DataFrame
MyGO_Stats_DF = pd.read_csv(cwd / 'downloads' / 'MyGO_Stats_DF.csv')
MyGO_Stats_DF.head(50)
Out[56]:
GO Term relationship depth go ID Description ctrl list type parent(s) Swiss-Prot_annotations Swiss-Prot_genesProducts TrEMBL_annotations TrEMBL_genesProducts
0 binding Molecular Function 1 GO:0005488 NaN NaN GO:0003674 NaN NaN NaN NaN
1 protein binding Molecular Function 2 GO:0005515 NaN NaN GO:0005488 NaN NaN NaN NaN
2 chromatin binding Molecular Function 2 GO:0003682 NaN NaN GO:0005488 NaN NaN NaN NaN
3 nucleic acid binding Molecular Function 3 GO:0003676 NaN NaN GO:0003676 15714.0 4127.0 18207.0 9319.0
4 DNA binding Molecular Function 4 GO:0003677 NaN general GO:0003676 9399.0 2603.0 9854.0 4772.0
5 RNA binding Molecular Function 4 GO:0003723 NaN general GO:0003676 5245.0 1685.0 4976.0 2620.0
6 DNA/RNA hybrid binding Molecular Function 4 GO:0071667 NaN general GO:0003676 0.0 0.0 0.0 0.0
7 translation regulator activity Molecular Function 4 GO:0090079 NaN NaN GO:0003676 359.0 111.0 673.0 399.0
8 regulatory region nucleic acid binding Molecular Function 4 GO:0001067 NaN NaN GO:0003676 2866.0 1514.0 608.0 554.0
9 annealing activity Molecular Function 4 GO:0097617 NaN NaN GO:0003676 21.0 13.0 7.0 7.0
10 ssDNA binding Molecular Function 5 GO:0003697 NaN general GO:0003677 221.0 120.0 100.0 96.0
11 dsDNA binding Molecular Function 5 GO:0003690 NaN general GO:0003677 3780.0 1715.0 826.0 740.0
12 ssRNA binding Molecular Function 5 GO:0003727 NaN general GO:0003723 165.0 88.0 36.0 35.0
13 dsRNA binding Molecular Function 5 GO:0003725 NaN general GO:0003723 138.0 79.0 69.0 68.0
14 mRNA binding Molecular Function 5 GO:0003729 NaN general GO:0003723 573.0 302.0 277.0 246.0
15 pre-mRNA binding Molecular Function 5 GO:0036002 NaN general GO:0003723 48.0 38.0 21.0 17.0
16 snRNA binding Molecular Function 5 GO:0017069 NaN general GO:0003723 97.0 47.0 18.0 16.0
17 rRNA binding Molecular Function 5 GO:0019843 NaN general GO:0003723 95.0 64.0 72.0 61.0
18 rRNA primary transcript binding Molecular Function 6 GO:0042134 NaN NaN GO:0019843 3.0 1.0 0.0 0.0
19 5S rRNA binding Molecular Function 6 GO:0008097 NaN NaN GO:0019843 16.0 11.0 10.0 10.0
20 rRNA modification guide activity Molecular Function 6 GO:0030556 NaN NaN GO:0019843 0.0 0.0 0.0 0.0
21 large ribosomal subunity rRNA Molecular Function 6 GO:0070180 NaN NaN GO:0019843 5.0 5.0 0.0 0.0
22 small ribosomal subunit rRNA Molecular Function 6 GO:0070181 NaN NaN GO:0019843 10.0 10.0 0.0 0.0
23 B-WICH complex Molecular Function 6 GO:0110016 A chromatin remodeling complex that positively... NaN GO:0019843 0.0 0.0 0.0 0.0
24 5.8S rRNA binding Molecular Function 6 GO:1990932 NaN NaN GO:0019843 0.0 0.0 0.0 0.0
25 ribosome biogenesis Biological Process 4 GO:0042254 NaN ribozero GO:0022613 909.0 312.0 460.0 259.0
26 ribosome assembly Biological Process 5 GO:0042255 NaN ribozero GO:0042254 85.0 66.0 25.0 22.0
27 regulation of ribosome biogenesis Biological Process 5 GO:0090069 NaN NaN GO:0042254 19.0 16.0 2.0 2.0
28 positive regulation of ribosome biogenesis Biological Process 5 GO:0090070 NaN NaN GO:0042254 11.0 10.0 0.0 0.0
29 negative regulation of ribosome biogenesis Biological Process 5 GO:0090071 NaN NaN GO:0042254 5.0 5.0 1.0 1.0
30 ribosomal large subunit biogenesis Biological Process 5 GO:0042273 NaN NaN GO:0042254 122.0 74.0 49.0 29.0
31 ribosomal small subunit biogenesis Biological Process 5 GO:0042274 NaN NaN GO:0042254 124.0 73.0 18.0 17.0
32 rRNA export from the nucleus Biological Process 5 GO:0006407 NaN NaN GO:0042254 2.0 2.0 0.0 0.0
33 ribosomal subunit export from nucleus Biological Process 5 GO:0000054 NaN NaN GO:0042254 27.0 14.0 23.0 12.0
34 rRNA processing Biological Process 5 GO:0006364 NaN NaN GO:0042254 590.0 227.0 254.0 152.0
35 ribosome disassembly Biological Process 5 GO:0032790 NaN NaN GO:1903008 12.0 8.0 4.0 2.0
36 ribosome localization Biological Process 4 GO:0033750 NaN ribozero GO:0008104;GO:0051640 27.0 14.0 23.0 12.0
37 mitochondrial ribosome assembly Biological Process 6 GO:0061668 NaN ribozero GO:0042254 8.0 6.0 0.0 0.0
38 mitochondrial ribosome Cellular Component 10 GO:0005761 NaN ribozero GO:000313;GO:0005759 251.0 88.0 26.0 25.0
39 ribosome Cellular Component 8 GO:0005840 NaN ribozero GO:0043232 1400.0 234.0 1911.0 713.0
40 polysomal ribosome Cellular Component 9 GO:0042788 NaN NaN GO:0005840 35.0 33.0 3.0 3.0
41 organellar ribosome Cellular Component 9 GO:0000313 NaN NaN GO:0005840 251.0 88.0 26.0 25.0
42 cytosolic ribosome Cellular Component 9 GO:0022626 NaN NaN GO:0005840 295.0 112.0 31.0 31.0
43 structural constituent of ribosome Cellular Component 9 GO:0003735 NaN NaN GO:0005840 411.0 165.0 608.0 580.0
44 ribosomal subunit Cellular Component 9 GO:0044391 NaN NaN GO:0005840 555.0 190.0 138.0 135.0
45 RQC complex Cellular Component 2 GO:1990112 A multiprotein complex that forms a stable com... ribozero GO:0032991 3.0 2.0 6.0 3.0
46 ribonucleoprotein complex binding Molecular Function 3 GO:0043021 NaN ribozero GO:0044877 196.0 135.0 84.0 78.0
47 ribosome binding Molecular Function 4 GO:0043022 NaN ribozero GO:0043021 81.0 60.0 47.0 43.0
48 mitochondrial ribosome binding Molecular Function 5 GO:0097177 NaN ribozero GO:0043022 6.0 5.0 0.0 0.0
49 mitochondrion Cellular Component 5 GO:0005739 NaN NaN GO:0043231;GO:0005737 10259.0 1677.0 34454.0 13352.0

Results: GOOD

We can see that, with the exception of dsDNA binding, the population of members belonging to RNA-binding at depth 5 is quite sparse, especially compared to the depth 4 RNA binding category. mRNA-binding is by far the most populated group, most likely due to influences from previous OdT captures. I would expect, however, that annotations with experimental evidence are likely to be dramatically low- a good indication of the nascent state of RNA-binding protein research and one that should probably be emphasised in a future introduction.

From depth 4 onwards the SwissProt and TrEMBL counts are below 10000 records so, conveniently, we can continue the tutorial with the simplest scenario for record extraction.

20. Retrieve GO Records

Functions
jweb.fetchQuickGO( )
jwrangle.concatGO_DataFrameDict( )
jweb.mapQuickGO( )

We can see above that none of our downloaded IDs have exceeded our fetch limit of 10000 records. If any did, we would need to manually retrieve the tsv file.
For the GO terms we wish to investigate to investigate jweb.fetchQuickGO( ) will fetch the relevant proteins and also apply the same onsistent gene ID mapping algorithm we used earlier when remapping the MaxQuant IDs. This process is important because it allows us to provide consistent gene IDs to subsequent sets analysis. Where the writetopath options is used, the results from any searches will be saved to a local downloads folder- this is useful as a snapshot of the annotations used at the time or as a simple way of avoiding fetching the records via API in the future.

The function jwrangle.concatGO_DataFrameDict( ) let's us concatenate a dictionary created by jweb.fetchQuickGO( ). This is useful when the user wishes to pool both SwissProt and TrEMBL records for a given

A Note on Manual Downloads
The jweb.fetchQuickGO( ) will also output to console the html address that one should use if a manual download if required; this address will encompass all the relevant record characteristics that typically used by this suite. Once this table has been downloaded, the function jweb.mapQuickGO( ) can be used to provide the same ID mapping service as performed in jweb.fetchQuickGO( ). The returned elements will also be manipulated into the same format.

In [57]:
#### Download and create a local copy of all the records associated with our GO query from depth 4 onwards.  
# QuickGo_dict = jweb.fetchQuickGO(MyGO_Stats_DF['go ID'][4:], QG_geneProductType = 'protein', QG_taxonId = '9606', QG_geneProductSubset = ['Swiss-Prot', 'TrEMBL'], gConvertOrganism='hsapiens', writetopath= True)
In [58]:
#### To avoid future API requests we will load the saved files....
QuickGo_dict = jwrangle.importMixedFiles(cwd / 'downloads' / 'QuickGo')

#### ....and concatenate all SwissProt and TrEMBL entries into a single dictionary.
QuickGo_dict_concat = jwrangle.concatGO_DataFrameDict(QuickGo_dict)
QuickGo_dict_concat.keys()
Out[58]:
dict_keys(['GO0000054', 'GO0000313', 'GO0001067', 'GO0003677', 'GO0003690', 'GO0003697', 'GO0003723', 'GO0003725', 'GO0003727', 'GO0003729', 'GO0003735', 'GO0005739', 'GO0005761', 'GO0005840', 'GO0006364', 'GO0006407', 'GO0008097', 'GO0017069', 'GO0019843', 'GO0022626', 'GO0030556', 'GO0032790', 'GO0033750', 'GO0036002', 'GO0042134', 'GO0042254', 'GO0042255', 'GO0042273', 'GO0042274', 'GO0042788', 'GO0043021', 'GO0043022', 'GO0044391', 'GO0061668', 'GO0070180', 'GO0070181', 'GO0071667', 'GO0090069', 'GO0090070', 'GO0090071', 'GO0090079', 'GO0097177', 'GO0097617', 'GO0110016', 'GO1990112', 'GO1990932'])

Results: GOOD

All records sucessfully retrieved, saved locally, and a dictionary of concatenated SwissProt and TrEMBL entries has been generated.

21. Analyse GO Memberships #1: RNA-binding

Functions
jwrangle.AnnotateDataFrameCtrls( )
jinspect.MQ_getFrequencyBySample( )

The function jwrangle.AnnotateDataFrameCtrls( ) takes our QuickGo records and annotates each protein or gene in a given dataframe (i.e. a proteinGroups table) for membership to the searched GO ID. The function jinspect.MQ_getFrequencyBySample( ) acts similarly to the peptide and gene count functions used earlier- it will sum the frequency of memberships to each GO catgeory and report the counts as a modified metadata table.

In [59]:
#### With our GO records in hand we next investigate whether our identified proteins are members of our GO IDs
#### Let's start by looking at only GO:0003723. First, dedicate a dictionary to this search
dict_GO0003723 = {'GO0003723': QuickGo_dict_concat['GO0003723']}

#### Next, annotate the last version of our modified proteinGroups table  
pGroup_GO0003723 = jwrangle.AnnotateDataFrameCtrls(pGroup_Freq, dict_GO0003723, search_match = 'ENTREZGENE_gPro primary', dict_match = 'ENTREZGENE_gPro primary', none_col = 'GO_None')

#### Next find the counts for each GO ID being searched we will modify our metadata table in a fashion in the same way as for prptide and gene counting.
#### We'll use iBAQ for these counts but, given all intensities have previously been filtered by LFQ membership both 'Intensity' or 'LFQ intensity' would give the same result.
metaStatsGO0003723 = jinspect.MQ_getFrequencyBySample(pGroup_GO0003723['ann_df'], metaStats, freqList = list(dict_GO0003723.keys()) + ['GO_None'], measure = 'iBAQ')

#### Now Calculate the % GO0003723 members per group
metaStatsGO0003723['% RBP'] = metaStatsGO0003723.apply(lambda row: round(100*row['GO0003723']/(row['GO0003723']+row['GO_None'])), axis = 1)

#### Plot
jvis.BarPlotByGroup_sbplot(metaStatsGO0003723, x_col = 'condition', y_col = '% RBP', title = 'GO:0003723, RNA-Binding', pal = set2_paired, yrange = [0,110])
D:\MEGA\Programming\Scripts_JS\RBP_SUITE\modules\jwrangle.py:43: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ctrl_nohits_df[none_col] = True                                                                           #add zero annotation column to annotation columns
<Figure size 432x288 with 0 Axes>

Results: GOOD

Between 80% - 100% of each capture represents known RNA-binding proteins at pH 4.5 > pH5.5 > pH 7.8. A cursory analysis might take the view the pH 4.5 is best because this % is closest to 100%. We must remember that this % is calculated only on previously reported RBPs. Thus the pH7.8 group might:

  • contain a much higher % of RBPs than is reflected here; it simply holds many previously unreported RBPs.
  • contain RBPs that interact closely with DNA-RNA duplexes; thus these are dragged down into the lower phase of a pH4.5 cocktail.
  • contain a higher % of DNA binding proteins due to DNA capture. Quite possible, although recall that at pH5.5 DNA also concentrates to the interphase. Increased DNA capture also cannot be explained by downstream processing steps because the interphase extracts are rebalanced to the same pH afterwards. For now, the pH7.8 cocktail remains our best choice; it does not require pH rebalancing and is compatible with OdT hybidisation. Further method dev

21. Analyse GO Memberships #1: Nucleic Acid Binding

Functions
jwrangle.AnnotateDataFrameCtrls( )
jinspect.MQ_getFrequencyBySample( )
jinspect.MQ_getMeans( )

The function jwrangle.AnnotateDataFrameCtrls( ) takes our QuickGo records and annotates each protein or gene in a given dataframe (i.e. a proteinGroups table) for membership to the searched GO ID. The function jinspect.MQ_getFrequencyBySample( ) acts similarly to the peptide and gene count functions used earlier- it will sum the frequency of memberships to each GO catgeory and report the counts as a modified metadata table. Finally, the function jinspect.MQ_getMeans( ) will make it easy for us to calculate the means across any specific group in our metadata; essentially it acts like a native python 'groupby'.

In [60]:
#### Let's create a QuickGo dictionary of proteins/genes with nucleic acid binding properties (depth 5). These have been marked as 'general' under the 'ctrl list type' column of our GO_stats dataframe.
list_NA = MyGO_Stats_DF[MyGO_Stats_DF['ctrl list type']=='general']['go ID'].apply(lambda x: x.replace(':','')).tolist()

dict_NA = {}
for key, value in QuickGo_dict_concat.items():
    if key in list_NA:
        dict_NA[key] = value

#### Next, annotate the last version of our modified proteinGroups table  
pGroup_NA = jwrangle.AnnotateDataFrameCtrls(pGroup_Freq, dict_NA, search_match = 'ENTREZGENE_gPro primary', dict_match = 'ENTREZGENE_gPro primary', none_col = 'GO_None')

#### Next find the counts for each GO ID being searched we will modify our metadata table in a fashion in the same way as for prptide and gene counting.
#### We'll use LFQ intensity for these counts but, given all intensities have previously been filtered by LFQ membership both 'Intensity' or 'iBAQ' would give the same result.
metaStats_NA = jinspect.MQ_getFrequencyBySample(pGroup_NA['ann_df'], metaStats, freqList = list(dict_NA.keys()) + ['GO_None'], measure = 'LFQ intensity')

#### Calculate the means for our selected groups
means_NA = jinspect.MQ_getMeans(metaStats_NA, [i for i in metaStats_NA.columns if 'GO' in i], group = 'condition')

#### Relabel the GO IDs with names to make reading easier
MyGO_IDs = dict(zip([i.replace(':','') for i in MyGO_Stats_DF['go ID'].tolist()], MyGO_Stats_DF['GO Term'])) 
col_names = []
for i in list(means_NA.columns):
    if 'GO0' in i:
        for key, value in MyGO_IDs.items():
            if key in i:
                col_names.append(i.replace(key, value))
    else:
        col_names.append(i)
        
means_NA.columns = col_names 
means_NA
Out[60]:
DNA binding dsDNA binding ssDNA binding RNA binding dsRNA binding ssRNA binding mRNA binding snRNA binding rRNA binding pre-mRNA binding DNA/RNA hybrid binding GO_None
pH4.5_nCL 12 3 2 26 1 4 12 1 1 3 0 0
pH4.5_254 131 53 20 575 28 33 128 13 33 19 0 50
pH5.5_nCL 9 2 2 12 0 3 6 0 0 2 0 0
pH5.5_254 149 61 24 689 31 38 145 24 37 22 0 87
pH7.8_nCL 10 2 2 16 1 4 8 1 0 3 0 0
pH7.8_254 182 78 29 792 35 41 159 28 41 22 0 145
In [61]:
#### Use the native method .stack() to create a DataFrame for the stakced histogram
means_NA_stack = means_NA.stack().reset_index()
means_NA_stack.columns=['condition', 'GO code', 'mean count']
means_NA_stack = means_NA_stack.sort_values(['condition'])
means_NA_stack.head(3)

#### Chart Nucleic Acid GO memberships in a stacked histogram
chart = alt.Chart(means_NA_stack
        ).mark_bar(cornerRadiusBottomRight=0, cornerRadiusTopRight=0, stroke='black', fillOpacity=0.7, strokeWidth=0.5,
        ).encode(x='sum(mean count):Q', y='condition:N', color=alt.Color('GO code:N', scale=alt.Scale(scheme='Set3'))
        ).properties(height=200, width=400, title = 'GO memberships: Nucleic Acids')

chart.configure_title(fontSize = 15, fontWeight='normal'
                     ).configure_axis(labelFontSize=14, labelFontWeight='normal', titleFontSize = 14, titleFontWeight='normal'
                     ).configure_legend(titleFontSize = 14, titleFontWeight='normal', symbolStrokeWidth = 0.5, labelFontSize = 13)
Out[61]:

Result: GOOD

The counts are dominated by RNA binders, with some DNA binders present. Given the overall % RNA binders is very high, however, we can presume that these DNA binders are proteins that likely can also bind RNA. This is reinforced by the lack of DNA binders being present in the non-crosslinked samples.

Conclusion

The pH7.8 protocol appears ideal:

  • More proteins purified than any of the other cocktails
  • Strong RBP classification characteristics
  • High level of agreement with previously identified RBPs
  • Excellent pI characteristics, stable across classes I and II
  • pI matches that of Baltz RIC
  • ND classified proteins exhibit normal proteome pI- suggesting that pI selection bias is not present among the enriched fraction

This tutorial notebook will end here. Further functions available in RBP Suite will be explored in other notebooks.